https://data.cityofnewyork.us/Transportation/2015-Yellow-Taxi-Trip-Data/ba8s-jw6u
The original data set contained records of all yellow ride in NYC between January and June 2015. To make the size of the data set more managable, we restricted the data set to a random 10% of rides occuring on May 6th (a randomly chosen weekday) that were paid with credit card. The data set included many specific details about the ride (e.g. number of passengers, rates, amount paid, etc.) We focus exploring 3 features of this dataset: time (dropoff_datetime and pickup_datetime), location (dropoff_latitude, dropoff_longitude, pickup_latitude, and pickup_longitude), and payments, specifically tips (tip_amount).
We are particularly interested in exploring 3 features of this taxi trips dataset: time (dropoff_datetime and pickup_datetime), location (dropoff_latitude, dropoff_longitude, pickup_latitude, and pickup_longitude), and payments, specifically tips (tip_amount).
ggplot(taxi, aes(x = as.factor(passenger_count), y = tip_pct)) +
geom_boxplot() +
labs(title = "Percentage Tipped by Number of Passengers", x =
"Number of Passengers", y = "Percentage Tipped")
This plot demonstrates that the tip percentage does not appear to change with the addition of more passengers. The average tip appears to be around 20% which is considered a standard tip. We need to further explore whether tip percentage is affected by location or distance, which we do in the regression below. Furthermore, we will explore where larger groups tend to travel. For example, we predict that larger groups may travel to more touristy areas.
ggplot(taxi, aes(x = avg_speed, y = tip_pct)) +
geom_smooth(method = lm) +
labs(title = "Percentage Tipped by Average Speed", x =
"Average Speed (mph)", y = "Percentage Tipped")
This plot shows that there is a positive correlation between avg_speed and tip_pct, meaning that passengers tip more if they get to where they are going more quickly. tip_pct increases about 0.5% when going from an avg_speed of 0 to 50 mpg.
tip_pct## (Intercept) pickup_hour pickup_latitude pickup_longitude dropoff_hour
## 1 TRUE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE FALSE
## 3 TRUE FALSE FALSE FALSE FALSE
## 4 TRUE FALSE FALSE FALSE FALSE
## 5 TRUE TRUE FALSE FALSE FALSE
## 6 TRUE TRUE FALSE TRUE FALSE
## 7 TRUE TRUE TRUE TRUE FALSE
## 8 TRUE TRUE TRUE TRUE FALSE
## 9 TRUE TRUE TRUE TRUE FALSE
## 10 TRUE TRUE TRUE TRUE TRUE
## 11 TRUE TRUE TRUE TRUE TRUE
## dropoff_latitude dropoff_longitude passenger_count tolls_amount extra
## 1 FALSE FALSE FALSE FALSE TRUE
## 2 FALSE FALSE FALSE TRUE TRUE
## 3 FALSE TRUE FALSE TRUE TRUE
## 4 FALSE TRUE FALSE TRUE TRUE
## 5 FALSE TRUE FALSE TRUE TRUE
## 6 FALSE TRUE FALSE TRUE TRUE
## 7 FALSE TRUE FALSE TRUE TRUE
## 8 FALSE TRUE TRUE TRUE TRUE
## 9 FALSE TRUE TRUE TRUE TRUE
## 10 FALSE TRUE TRUE TRUE TRUE
## 11 TRUE TRUE TRUE TRUE TRUE
## vendor_id avg_speed
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE TRUE
## 5 FALSE TRUE
## 6 FALSE TRUE
## 7 FALSE TRUE
## 8 FALSE TRUE
## 9 TRUE TRUE
## 10 TRUE TRUE
## 11 TRUE TRUE
## [1] 5
## [1] "pickup_hour" "dropoff_longitude" "tolls_amount"
## [4] "extra" "avg_speed"
plot(r1, scale="Cp", main = "Most Important Predictors in Different Size Models")
mbest <- lm(tip_pct ~ pickup_hour + tolls_amount + extra + avg_speed,
data = taxi)
summary(mbest)
##
## Call:
## lm(formula = tip_pct ~ pickup_hour + tolls_amount + extra + avg_speed,
## data = taxi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.860 -3.078 1.083 3.193 30.247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.426789 0.150757 135.495 < 2e-16 ***
## pickup_hour 0.020980 0.008366 2.508 0.0122 *
## tolls_amount 0.434221 0.041156 10.551 < 2e-16 ***
## extra 1.981691 0.135966 14.575 < 2e-16 ***
## avg_speed -0.039721 0.009136 -4.348 1.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.396 on 26823 degrees of freedom
## Multiple R-squared: 0.01608, Adjusted R-squared: 0.01593
## F-statistic: 109.6 on 4 and 26823 DF, p-value: < 2.2e-16
We will create regions of NYC, defined by the given latitudes and longitudes, in order to see if there are certain areas that tip more or less than others, rather than simply including latitudes and longitudes as numeric, linear predictors.
Additionally, we will explore interactions between various variables. We will look at interaction plots to determine which variables to interact in the subsequent model.
locationPlot(taxi[taxi$tip_pct <= 50,], taxi$tip_pct, "dropoff",
title = "Average Tip by Region", varname = "Tip Pct")
## Map from URL : http://tile.stamen.com/watercolor/12/1204/1537.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1205/1537.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1206/1537.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1207/1537.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1204/1538.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1205/1538.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1206/1538.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1207/1538.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1204/1539.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1205/1539.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1206/1539.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1207/1539.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1204/1540.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1205/1540.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1206/1540.jpg
## Map from URL : http://tile.stamen.com/watercolor/12/1207/1540.jpg
## Warning: Ignoring unknown aesthetics: x, y
This plot shows the average tip percentage in different areas of NYC where passengers were dropped off. We can see that in the central of Manhattan, most people tip around the standard amount of 20%. As you venture outward, there is more variability. We will explore this variability more in the final report.
## Start: AIC=29323.21
## tip25 ~ pickup_hour + pickup_latitude + pickup_longitude + dropoff_hour +
## dropoff_latitude + dropoff_longitude + passenger_count +
## tolls_amount + extra + vendor_id + avg_speed
##
## Df Deviance AIC
## - pickup_hour 1 29299 29321
## - vendor_id 1 29299 29321
## - avg_speed 1 29299 29321
## - dropoff_hour 1 29300 29322
## - dropoff_latitude 1 29300 29322
## - passenger_count 1 29300 29322
## <none> 29299 29323
## - pickup_longitude 1 29302 29324
## - dropoff_longitude 1 29302 29324
## - pickup_latitude 1 29305 29327
## - tolls_amount 1 29362 29384
## - extra 1 29709 29731
##
## Step: AIC=29321.21
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_hour + dropoff_latitude +
## dropoff_longitude + passenger_count + tolls_amount + extra +
## vendor_id + avg_speed
##
## Df Deviance AIC
## - vendor_id 1 29299 29319
## - avg_speed 1 29299 29319
## - dropoff_latitude 1 29300 29320
## - passenger_count 1 29300 29320
## - dropoff_hour 1 29301 29321
## <none> 29299 29321
## - pickup_longitude 1 29302 29322
## - dropoff_longitude 1 29302 29322
## - pickup_latitude 1 29305 29325
## - tolls_amount 1 29362 29382
## - extra 1 29716 29736
##
## Step: AIC=29319.25
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_hour + dropoff_latitude +
## dropoff_longitude + passenger_count + tolls_amount + extra +
## avg_speed
##
## Df Deviance AIC
## - avg_speed 1 29300 29318
## - dropoff_latitude 1 29300 29318
## - passenger_count 1 29300 29318
## - dropoff_hour 1 29301 29319
## <none> 29299 29319
## - pickup_longitude 1 29302 29320
## - dropoff_longitude 1 29302 29320
## - pickup_latitude 1 29305 29323
## - tolls_amount 1 29362 29380
## - extra 1 29716 29734
##
## Step: AIC=29317.49
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_hour + dropoff_latitude +
## dropoff_longitude + passenger_count + tolls_amount + extra
##
## Df Deviance AIC
## - dropoff_latitude 1 29300 29316
## - passenger_count 1 29301 29317
## - dropoff_hour 1 29301 29317
## <none> 29300 29318
## - pickup_longitude 1 29302 29318
## - dropoff_longitude 1 29303 29319
## - pickup_latitude 1 29305 29321
## - tolls_amount 1 29364 29380
## - extra 1 29723 29739
##
## Step: AIC=29316.29
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_hour + dropoff_longitude +
## passenger_count + tolls_amount + extra
##
## Df Deviance AIC
## - passenger_count 1 29301 29315
## - dropoff_hour 1 29302 29316
## <none> 29300 29316
## - pickup_longitude 1 29303 29317
## - dropoff_longitude 1 29303 29317
## - pickup_latitude 1 29309 29323
## - tolls_amount 1 29364 29378
## - extra 1 29724 29738
##
## Step: AIC=29315.43
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_hour + dropoff_longitude +
## tolls_amount + extra
##
## Df Deviance AIC
## - dropoff_hour 1 29303 29315
## <none> 29301 29315
## - pickup_longitude 1 29304 29316
## - dropoff_longitude 1 29304 29316
## - pickup_latitude 1 29310 29322
## - tolls_amount 1 29365 29377
## - extra 1 29725 29737
##
## Step: AIC=29315.18
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_longitude +
## tolls_amount + extra
##
## Df Deviance AIC
## <none> 29303 29315
## - dropoff_longitude 1 29306 29316
## - pickup_longitude 1 29306 29316
## - pickup_latitude 1 29312 29322
## - tolls_amount 1 29366 29376
## - extra 1 29819 29829
m25best <- glm(tip25 ~ dropoff_longitude + pickup_longitude + pickup_latitude +
tolls_amount + extra, data = taxi, family = binomial)
summary(m25best)
##
## Call:
## glm(formula = tip25 ~ dropoff_longitude + pickup_longitude +
## pickup_latitude + tolls_amount + extra, family = binomial,
## data = taxi)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6554 -0.7712 -0.6494 -0.6098 1.9651
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -262.31857 97.04839 -2.703 0.00687 **
## dropoff_longitude -0.96970 0.60696 -1.598 0.11012
## pickup_longitude -1.23720 0.75771 -1.633 0.10251
## pickup_latitude 2.39441 0.81431 2.940 0.00328 **
## tolls_amount 0.10864 0.01358 8.000 1.24e-15 ***
## extra 0.81890 0.03603 22.728 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 29888 on 26827 degrees of freedom
## Residual deviance: 29303 on 26822 degrees of freedom
## AIC: 29315
##
## Number of Fisher Scoring iterations: 4
This model, m25best, tries to predict whether or not a passenger tips more than 25% using 5 variables (dropoff_longitude, pickup_longitude, pickup_latitude, tolls_amount, and extra) that were deemed most significant using a backwards stepwise regression. pickup_latitude has the biggest coefficient, 2.394, meaning that for every 1 unit increase in pickup latitude (move to the North), the log odds of a person tipping 25% increase by 2.394, or the odds are multiplied by \(e^2.394 = 10.957\).
## Start: AIC=29323.21
## tip25 ~ pickup_hour + pickup_latitude + pickup_longitude + dropoff_hour +
## dropoff_latitude + dropoff_longitude + passenger_count +
## tolls_amount + extra + vendor_id + avg_speed
##
## Df Deviance AIC
## - pickup_hour 1 29299 29321
## - vendor_id 1 29299 29321
## - avg_speed 1 29299 29321
## - dropoff_hour 1 29300 29322
## - dropoff_latitude 1 29300 29322
## - passenger_count 1 29300 29322
## <none> 29299 29323
## - pickup_longitude 1 29302 29324
## - dropoff_longitude 1 29302 29324
## - pickup_latitude 1 29305 29327
## - tolls_amount 1 29362 29384
## - extra 1 29709 29731
##
## Step: AIC=29321.21
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_hour + dropoff_latitude +
## dropoff_longitude + passenger_count + tolls_amount + extra +
## vendor_id + avg_speed
##
## Df Deviance AIC
## - vendor_id 1 29299 29319
## - avg_speed 1 29299 29319
## - dropoff_latitude 1 29300 29320
## - passenger_count 1 29300 29320
## - dropoff_hour 1 29301 29321
## <none> 29299 29321
## - pickup_longitude 1 29302 29322
## - dropoff_longitude 1 29302 29322
## - pickup_latitude 1 29305 29325
## - tolls_amount 1 29362 29382
## - extra 1 29716 29736
##
## Step: AIC=29319.25
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_hour + dropoff_latitude +
## dropoff_longitude + passenger_count + tolls_amount + extra +
## avg_speed
##
## Df Deviance AIC
## - avg_speed 1 29300 29318
## - dropoff_latitude 1 29300 29318
## - passenger_count 1 29300 29318
## - dropoff_hour 1 29301 29319
## <none> 29299 29319
## - pickup_longitude 1 29302 29320
## - dropoff_longitude 1 29302 29320
## - pickup_latitude 1 29305 29323
## - tolls_amount 1 29362 29380
## - extra 1 29716 29734
##
## Step: AIC=29317.49
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_hour + dropoff_latitude +
## dropoff_longitude + passenger_count + tolls_amount + extra
##
## Df Deviance AIC
## - dropoff_latitude 1 29300 29316
## - passenger_count 1 29301 29317
## - dropoff_hour 1 29301 29317
## <none> 29300 29318
## - pickup_longitude 1 29302 29318
## - dropoff_longitude 1 29303 29319
## - pickup_latitude 1 29305 29321
## - tolls_amount 1 29364 29380
## - extra 1 29723 29739
##
## Step: AIC=29316.29
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_hour + dropoff_longitude +
## passenger_count + tolls_amount + extra
##
## Df Deviance AIC
## - passenger_count 1 29301 29315
## - dropoff_hour 1 29302 29316
## <none> 29300 29316
## - pickup_longitude 1 29303 29317
## - dropoff_longitude 1 29303 29317
## - pickup_latitude 1 29309 29323
## - tolls_amount 1 29364 29378
## - extra 1 29724 29738
##
## Step: AIC=29315.43
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_hour + dropoff_longitude +
## tolls_amount + extra
##
## Df Deviance AIC
## - dropoff_hour 1 29303 29315
## <none> 29301 29315
## - pickup_longitude 1 29304 29316
## - dropoff_longitude 1 29304 29316
## - pickup_latitude 1 29310 29322
## - tolls_amount 1 29365 29377
## - extra 1 29725 29737
##
## Step: AIC=29315.18
## tip25 ~ pickup_latitude + pickup_longitude + dropoff_longitude +
## tolls_amount + extra
##
## Df Deviance AIC
## <none> 29303 29315
## - dropoff_longitude 1 29306 29316
## - pickup_longitude 1 29306 29316
## - pickup_latitude 1 29312 29322
## - tolls_amount 1 29366 29376
## - extra 1 29819 29829
##
## Call:
## glm(formula = tip15 ~ pickup_latitude + tolls_amount + extra +
## avg_speed, family = binomial, data = taxi)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7804 -0.6676 -0.6471 -0.5985 2.3511
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -22.803415 28.980839 -0.787 0.43137
## pickup_latitude 0.523905 0.711099 0.737 0.46127
## tolls_amount -0.088687 0.016259 -5.455 4.91e-08 ***
## extra -0.227505 0.041408 -5.494 3.93e-08 ***
## avg_speed 0.009195 0.003096 2.970 0.00298 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26076 on 26827 degrees of freedom
## Residual deviance: 26016 on 26823 degrees of freedom
## AIC: 26026
##
## Number of Fisher Scoring iterations: 4
grid.arrange(timeplot, pop_map, ncol = 2, widths = c(2,1))
The above plot shows the most popular taxi pickup and taxi dropoff sites in Manhattan. Regions 1-5 are the most popular pickup sites (1 = most popular, decerasing order). Regions 6-10 are the most popular dropoff sites (6 = most popular, decreasing order). The map on the right shows the geographic location of these sites, while the plot on the left show the number of pickups/dropoffs at these sites during each hour of the day. As can be seen, dropoff sites tend be most active around 7-8 am. In contrast, pickup sites are most active around 8-9 pm. This analysis will be extended expanding including all taxi rides for the day (rather than just 10%, as is currently done). Furthermore, we might investigate how the most popular pickup sites rank in terms of dropoff sites and vice versa.
MAXLAT1 <- 40.8
MINLAT1 <- 40.7
MAXLONG1 <- -73.92
MINLONG1<- -74.03
MINRIDES <- 30 # minimum number of rides for arrow to show up
map <- get_map(location = c(MINLONG, MINLAT, MAXLONG, MAXLAT), source = "stamen",
maptype = "watercolor", color = "bw")
ggmap(map) +
geom_segment(aes(x = avg_long, xend = (avg_long + long_dir_avg),
y = avg_lat, yend = (avg_lat + lat_dir_avg),
color = count), data = filter(grouped, count > MINRIDES),
arrow = arrow(length=unit(0.15, "cm"), ends="first",
type = "closed"), size = 0.9) + # adds direction vectors
labs(title = "Direction of Motion by Region",
x = expression(italic("Longitude")~"(degrees)"),
y = expression(italic("Latitude")~"(degrees)")) +
scale_color_gradientn("Number of Rides",
colors = c("purple","blue","green","yellow","orange","red")) +
facet_wrap( ~ daytime) +
scale_y_continuous(limits=c(MINLAT1, MAXLAT1)) +
scale_x_continuous(limits = c(MINLONG1, MAXLONG1))
This plot shows the average direction of travel for taxis with pickup locations in different regions of NYC. This was calculated by averaging the normalized direction vectors of all rides starting within a region. As can be be seen, there is a general outward flux of taxis both in the morning and in the evening. Interestingly, taxi rides starting farther from the center of tend to agree more on the outward direction (as indicated by longer arrows). The colors represent the number of rides starting in each region.
To extend this plot, we are considering drawing confidence ellipses around each arrow, although we are not quite sure yet about how the code for this would work. Furthermore, we will facet the plot into more time catagories to get a better view of traffic flow in. Finally we might investigate where exacly the outward directed taxis starting grom the periphery of Manhattan are headed.
This is another analysis option. To do this, we would create a heat map of Manhattan where different colors indicate different travel speeds of taxis.